Speeding Up Pattern Matching by Text Compression

نویسندگان

  • Yusuke Shibata
  • Takuya Kida
  • Shuichi Fukamachi
  • Masayuki Takeda
  • Ayumi Shinohara
  • Takeshi Shinohara
  • Setsuo Arikawa
چکیده

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-ZivWelch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searching time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

More Speed and More Compression: Accelerating Pattern Matching by Text Compression

This paper addresses the problem of speeding up string matching by text compression, and presents a compressed pattern matching (CPM) algorithm which finds a pattern within a text given as a collage system 〈D,S〉 such that variable sequence S is encoded by byte-oriented Huffman coding. The compression ratio is high compared with existing CPM algorithms addressing the problem, and the search time...

متن کامل

Speeding Up String Pattern Matching by Text Compression: The Dawn of a New Era

This paper describes our recent studies on string pattern matching in compressed texts mainly from practical viewpoints. The aim is to speed up the string pattern matching task, in comparison with an ordinary search over the original texts. We have successfully developed (1) an AC type algorithm for searching in Huffman encoded files, and (2) a KMP type algorithm and (3) a BM type algorithm for...

متن کامل

Pattern Matching Machine for Text Compressed Using Finite State Model

The classical pattern matching problem is to nd all occurrences of patterns in a text. In many practical cases, since the text is very large and stored in the secondary storage, most of the time for the pattern matching is dominated by data transmission of the text. Therefore the text compression can speed-up the pattern matching. In this framework it is required to develop an e cient pattern m...

متن کامل

Speed-up of Aho-Corasick Pattern Matching Machines by Rearranging States

This paper describes speed-up of string pattern matching by rearranging states in Aho-Corasick pattern matching machine, which is a kind of afinite automaton. We realized speed-up of string pattern matching using data compression. Although we obtain higher compression ratio using a finite state model, it doesn’t lead speed-up of string pattern matching. Because the pattern matching machine beco...

متن کامل

Byte pair encoding : a text compression scheme that accelerates pattern matching

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000